Evaluate Business Location Using Pedestrian Traffic: Day and Night
Authored by: Barkha Javed, Weiren Kong
Duration: 75 mins
Level: Intermediate
Pre-requisite Skills: Python
Scenario

As a business owner, I want to know how much pedestrian foot traffic occurs around me during the day and night, so that I can evaluate the suitability of the location and hours for my business.

Busy foot traffic in a business area may not always mean busy foot traffic at night.

As a business owner, I want to know how much pedestrian foot traffic I get compared to surrounding areas, so that I can assess if it is better to adapt my hours, extend or move locations.

  • Foot traffic flow during day or night may indicate adapting staff levels during specific hours.

  • Significantly low foot traffic in comparison to other streets may mean adapting business strategy or moving location.

  • Duration of steady foot traffic from early morning to mid afternoon only, may indicate adapting business hours to match.

What this use case will teach you

At the end of this use case you will understand how to:

  • Load and examine the pedestrian sensor location dataset
  • Load and examine the pedestrian counting system - monthly (counts per hour)monthly hourly counts dataset
  • Use these datasets to explore day and night foot traffic in an area
  • Use visualisations to share results
  • Use machine learning to identify if there are any common patterns in day and night traffic
A brief introduction to the datasets used

The datasets we examine for day and night traffic include the datsets below

Pedestrian Counting System - Monthly (counts per hour)

Pedestrian Counting System - Sensor Locations

The monthly hourly counts dataset contains traffic collected from sensors since 2009, it is updated monthly. It contains 10 fields including sensor id, date and time, year, month, day of year, day of week, time of day, sensor name and hourly counts.

The sensor locations dataset contains the details about the sensors detecting the pedestrian traffic. It contains 11 fields, the main ones of interest are sensor description, latitude, longitude, location and direction of reading.

Accessing and Loading data
In [1]:
#Libraries to be installed
##!pip -q is to give less output
# !pip -q install sodapy
# !pip -q install seaborn
# !pip -q install pandas
# !pip -q install matplotlib
# !pip -q install numpy
# !pip -q install nbconvert
# !pip -q install keyboard
# !pip -q install geopandas
# !pip -q install sklearn
# !pip -q install folium
In [2]:
#load libraries
import os
import time

from datetime import datetime
import numpy as np
import pandas as pd
from sodapy import Socrata
import seaborn as sns

from sklearn.cluster import KMeans
from sklearn.preprocessing import MinMaxScaler
from sklearn.decomposition import PCA

import matplotlib.pyplot as plt
%matplotlib inline
import matplotlib.cm as cm
import matplotlib.colors as colors
from matplotlib import style
style.use('ggplot')

import plotly.graph_objs as go
import plotly.express as px

import warnings
warnings.filterwarnings('ignore')

import folium
from folium.plugins import MarkerCluster

import requests

Load Pedestrian sensor location data¶

In [3]:
#Pedestrian sensor location data
#This URL is created from v2 API of CoM open data
url = 'https://data.melbourne.vic.gov.au/api/v2/catalog/datasets/pedestrian-counting-system-sensor-locations/exports/json?limit=-1&offset=0&timezone=UTC'

#Use requests to get json response
r = requests.get(url)
response = r.json()

#Store and output in dataframe
sensor_location = pd.DataFrame(response)
sensor_location.head(5)
Out[3]:
location_id sensor_description sensor_name installation_date note location_type status direction_1 direction_2 latitude longitude location
0 2 Bourke Street Mall (South) Bou283_T 2009-03-30 None Outdoor A East West -37.813807 144.965167 {'lon': 144.96516718, 'lat': -37.81380668}
1 4 Town Hall (West) Swa123_T 2009-03-23 None Outdoor A North South -37.814880 144.966088 {'lon': 144.9660878, 'lat': -37.81487988}
2 6 Flinders Street Station Underpass FliS_T 2009-03-25 Upgraded on 8/09/21 Outdoor A North South -37.819117 144.965583 {'lon': 144.96558255, 'lat': -37.81911705}
3 8 Webb Bridge WebBN_T 2009-03-24 None Outdoor A North South -37.822935 144.947175 {'lon': 144.9471751, 'lat': -37.82293543}
4 10 Victoria Point BouHbr_T 2009-04-23 None Outdoor A East West -37.818765 144.947105 {'lon': 144.94710545, 'lat': -37.81876474}
In [4]:
# Rename header to match the below dataset
sensor_location = sensor_location.rename(columns={'location_id': 'sensor_id'})

# Alter lat and long for graphing
sensor_location[['latitude', 'longitude']] = sensor_location[['latitude', 'longitude']].astype(float)
sensor_location = sensor_location.drop('location',axis=1)
sensor_location['lat'] = sensor_location['latitude'].apply(lambda x: float(x))
sensor_location['lon'] = sensor_location['longitude'].apply(lambda x: float(x))

Load Pedestrian monthly hourly count data¶

In [6]:
# **** IMPORTANT ***
# This dataset is not available through API v2
# It must be downloaded via CSV which is saved locally
# Please ensure when merging future code, that the CSV file created when using this use case is not in the directory

from urllib.request import urlopen
from io import BytesIO
from zipfile import ZipFile

url = 'https://data.melbourne.vic.gov.au/api/datasets/1.0/pedestrian-counting-system-monthly-counts-per-hour/attachments/pedestrian_counting_system_monthly_counts_per_hour_may_2009_to_14_dec_2022_csv_zip/'

def download_and_unzip(url, extract_to='.'):
    http_response = urlopen(url)
    zipfile = ZipFile(BytesIO(http_response.read()))
    zipfile.extractall(path=extract_to)
    
# *** PLEASE UNCOMMENT BELOW TO EXECUTE CSV SAVE ***    

# download_and_unzip(url)

# sensor_traffic = pd.read_csv('Pedestrian_Counting_System_Monthly_counts_per_hour_may_2009_to_14_dec_2022.csv')

# sensor_traffic = sensor_traffic.rename(columns=str.lower)

# sensor_traffic.head(10)
Out[6]:
id date_time year month mdate day time sensor_id sensor_name hourly_counts
0 2887628 November 01, 2019 05:00:00 PM 2019 November 1 Friday 17 34 Flinders St-Spark La 300
1 2887629 November 01, 2019 05:00:00 PM 2019 November 1 Friday 17 39 Alfred Place 604
2 2887630 November 01, 2019 05:00:00 PM 2019 November 1 Friday 17 37 Lygon St (East) 216
3 2887631 November 01, 2019 05:00:00 PM 2019 November 1 Friday 17 40 Lonsdale St-Spring St (West) 627
4 2887632 November 01, 2019 05:00:00 PM 2019 November 1 Friday 17 36 Queen St (West) 774
5 2887633 November 01, 2019 05:00:00 PM 2019 November 1 Friday 17 29 St Kilda Rd-Alexandra Gardens 644
6 2887634 November 01, 2019 05:00:00 PM 2019 November 1 Friday 17 42 Grattan St-Swanston St (West) 453
7 2887635 November 01, 2019 05:00:00 PM 2019 November 1 Friday 17 43 Monash Rd-Swanston St (West) 387
8 2887636 November 01, 2019 05:00:00 PM 2019 November 1 Friday 17 44 Tin Alley-Swanston St (West) 27
9 2887637 November 01, 2019 05:00:00 PM 2019 November 1 Friday 17 35 Southbank 2691

Merge Pedestrian sensor location and traffic data¶

In [7]:
#Add date column
sensor_traffic['date'] = pd.to_datetime(sensor_traffic.date_time).dt.date

#Add day of week column
sensor_traffic['dow'] = pd.to_datetime(sensor_traffic.date_time).dt.day_of_week

#convert sensor_id to integer
sensor_traffic['sensor_id']=sensor_traffic['sensor_id'].astype(int)
sensor_location['sensor_id']=sensor_location['sensor_id'].astype(int)

# Mesh pedestrian sensor location and foot traffic datasets
sensor_ds = pd.merge(sensor_traffic, sensor_location, on='sensor_id')

#Simply using the year to differentiate all the years prior to 2020 as before Covid, and post 2019 Covid traffic to now
sensor_ds['pre2020_hourly_counts'] = np.where(sensor_ds['year']<2020,sensor_ds['hourly_counts'] , 0)
sensor_ds['post2019_hourly_counts'] = np.where(sensor_ds['year']>2019,sensor_ds['hourly_counts'] , 0)

#Add column for day or night traffic
#Add column for day (5am to 5pm) or night (6pm to 4am) traffic
sensor_ds['day_counts']   = np.where(((sensor_ds['time']>4)  & (sensor_ds['time']<18)),sensor_ds['hourly_counts'] , 0)
sensor_ds['night_counts'] = np.where(sensor_ds['day_counts']==0,sensor_ds['hourly_counts'], 0)

sensor_ds.describe()
Out[7]:
id year mdate time sensor_id hourly_counts dow latitude longitude lat lon pre2020_hourly_counts post2019_hourly_counts day_counts night_counts
count 4.128871e+06 4.128871e+06 4.128871e+06 4.128871e+06 4.128871e+06 4.128871e+06 4.128871e+06 4.128871e+06 4.128871e+06 4.128871e+06 4.128871e+06 4.128871e+06 4.128871e+06 4.128871e+06 4.128871e+06
mean 2.367206e+06 2.017557e+03 1.574860e+01 1.147254e+01 2.783812e+01 4.645136e+02 3.001120e+00 -3.781332e+01 1.449616e+02 -3.781332e+01 1.449616e+02 3.653186e+02 9.919500e+01 3.360644e+02 1.284492e+02
std 1.316327e+06 3.542266e+00 8.798729e+00 6.936793e+00 2.051895e+01 7.099215e+02 2.000348e+00 6.473498e-03 8.685007e-03 6.473498e-03 8.685007e-03 6.966499e+02 3.018989e+02 6.724927e+02 3.715865e+02
min 1.000000e+00 2.009000e+03 1.000000e+00 0.000000e+00 1.000000e+00 0.000000e+00 0.000000e+00 -3.782402e+01 1.449297e+02 -3.782402e+01 1.449297e+02 0.000000e+00 0.000000e+00 0.000000e+00 0.000000e+00
25% 1.241622e+06 2.015000e+03 8.000000e+00 5.000000e+00 1.000000e+01 4.100000e+01 1.000000e+00 -3.781874e+01 1.449564e+02 -3.781874e+01 1.449564e+02 0.000000e+00 0.000000e+00 0.000000e+00 0.000000e+00
50% 2.411161e+06 2.018000e+03 1.600000e+01 1.100000e+01 2.400000e+01 1.710000e+02 3.000000e+00 -3.781381e+01 1.449644e+02 -3.781381e+01 1.449644e+02 4.100000e+01 0.000000e+00 2.300000e+01 0.000000e+00
75% 3.508766e+06 2.021000e+03 2.300000e+01 1.700000e+01 4.300000e+01 5.610000e+02 5.000000e+00 -3.781102e+01 1.449668e+02 -3.781102e+01 1.449668e+02 3.900000e+02 4.200000e+01 3.320000e+02 5.700000e+01
max 4.567701e+06 2.022000e+03 3.100000e+01 2.300000e+01 8.700000e+01 1.597900e+04 6.000000e+00 -3.779432e+01 1.449747e+02 -3.779432e+01 1.449747e+02 1.597900e+04 1.443700e+04 1.161200e+04 1.597900e+04

Separate day and night data sets¶

In [8]:
flag_value=0
df_day=sensor_ds.query("day_counts > @flag_value")
print('Day info\n', df_day.info(),'\n')

df_night=sensor_ds.query("day_counts == @flag_value")
print('Night info \n',df_night.info(),'\n')
<class 'pandas.core.frame.DataFrame'>
Int64Index: 2221632 entries, 0 to 4128864
Data columns (total 28 columns):
 #   Column                  Dtype  
---  ------                  -----  
 0   id                      int64  
 1   date_time               object 
 2   year                    int64  
 3   month                   object 
 4   mdate                   int64  
 5   day                     object 
 6   time                    int64  
 7   sensor_id               int64  
 8   sensor_name_x           object 
 9   hourly_counts           int64  
 10  date                    object 
 11  dow                     int64  
 12  sensor_description      object 
 13  sensor_name_y           object 
 14  installation_date       object 
 15  note                    object 
 16  location_type           object 
 17  status                  object 
 18  direction_1             object 
 19  direction_2             object 
 20  latitude                float64
 21  longitude               float64
 22  lat                     float64
 23  lon                     float64
 24  pre2020_hourly_counts   int64  
 25  post2019_hourly_counts  int64  
 26  day_counts              int64  
 27  night_counts            int64  
dtypes: float64(4), int64(11), object(13)
memory usage: 491.5+ MB
Day info
 None 

<class 'pandas.core.frame.DataFrame'>
Int64Index: 1907239 entries, 1 to 4128870
Data columns (total 28 columns):
 #   Column                  Dtype  
---  ------                  -----  
 0   id                      int64  
 1   date_time               object 
 2   year                    int64  
 3   month                   object 
 4   mdate                   int64  
 5   day                     object 
 6   time                    int64  
 7   sensor_id               int64  
 8   sensor_name_x           object 
 9   hourly_counts           int64  
 10  date                    object 
 11  dow                     int64  
 12  sensor_description      object 
 13  sensor_name_y           object 
 14  installation_date       object 
 15  note                    object 
 16  location_type           object 
 17  status                  object 
 18  direction_1             object 
 19  direction_2             object 
 20  latitude                float64
 21  longitude               float64
 22  lat                     float64
 23  lon                     float64
 24  pre2020_hourly_counts   int64  
 25  post2019_hourly_counts  int64  
 26  day_counts              int64  
 27  night_counts            int64  
dtypes: float64(4), int64(11), object(13)
memory usage: 422.0+ MB
Night info 
 None 

Separate day and night, and before and after Covid for mapping

In [9]:
#split dataset to map difference between time before covid-19 and time after covid-19
df = df_day
print(df_day.head(2))

df_daypft_beforecovid = df.loc[df['year']<2020]
df_daypft_aftercovid = df.loc[df['year']>2019]

#get average hourly count for each sensor during the selected period of time
df_daypft_beforecovid_avg = df_daypft_beforecovid[['sensor_id','sensor_description','lat','lon','hourly_counts']]
df_daypft_beforecovid_avg = df_daypft_beforecovid_avg.groupby(['sensor_id','sensor_description','lat','lon',],as_index=False).agg({'hourly_counts': 'mean'})
df_daypft_aftercovid_avg = df_daypft_aftercovid[['sensor_id','sensor_description','lat','lon','hourly_counts']]
df_daypft_aftercovid_avg = df_daypft_aftercovid_avg.groupby(['sensor_id','sensor_description','lat','lon',],as_index=False).agg({'hourly_counts': 'mean'})
         id                      date_time  year     month  mdate       day  \
0   2887629  November 01, 2019 05:00:00 PM  2019  November      1    Friday   
12  2888289  November 02, 2019 05:00:00 AM  2019  November      2  Saturday   

    time  sensor_id sensor_name_x  hourly_counts  ... direction_1  \
0     17         39  Alfred Place            604  ...       North   
12     5         39  Alfred Place              9  ...       North   

    direction_2   latitude   longitude        lat         lon  \
0         South -37.813797  144.969957 -37.813797  144.969957   
12        South -37.813797  144.969957 -37.813797  144.969957   

   pre2020_hourly_counts post2019_hourly_counts day_counts night_counts  
0                    604                      0        604            0  
12                     9                      0          9            0  

[2 rows x 28 columns]
In [10]:
#split dataset to see difference between time before covid-19 and time after covid-19
df = df_night
print(df.head(2))

df_night_beforecovid = df.loc[df['year']<2020]
df_night_aftercovid = df.loc[df['year']>2019]

#get average hourly count for each sensor during the selected period of time
df_night_beforecovid_avg = df_night_beforecovid[['sensor_id','sensor_description','lat','lon','pre2020_hourly_counts']]
df_night_beforecovid_avg = df_night_beforecovid_avg.groupby(['sensor_id','sensor_description','lat','lon',],as_index=False).agg({'pre2020_hourly_counts': 'mean'})
df_night_aftercovid_avg = df_night_aftercovid[['sensor_id','sensor_description','lat','lon','post2019_hourly_counts']]
df_night_aftercovid_avg = df_night_aftercovid_avg.groupby(['sensor_id','sensor_description','lat','lon',],as_index=False).agg({'post2019_hourly_counts': 'mean'})
        id                      date_time  year     month  mdate     day  \
1  2887684  November 01, 2019 06:00:00 PM  2019  November      1  Friday   
2  2887739  November 01, 2019 07:00:00 PM  2019  November      1  Friday   

   time  sensor_id sensor_name_x  hourly_counts  ... direction_1  direction_2  \
1    18         39  Alfred Place            384  ...       North        South   
2    19         39  Alfred Place            289  ...       North        South   

    latitude   longitude        lat         lon pre2020_hourly_counts  \
1 -37.813797  144.969957 -37.813797  144.969957                   384   
2 -37.813797  144.969957 -37.813797  144.969957                   289   

  post2019_hourly_counts day_counts night_counts  
1                      0          0          384  
2                      0          0          289  

[2 rows x 28 columns]
Examine Pedestrian Traffic

Pedestrian traffic has decreased after 2019, the aim is to try to understand patterns of day and night traffic.

In [11]:
#examine pre Covid and post 2019 foot traffic
ds = pd.DataFrame(sensor_ds.groupby(["time"])["pre2020_hourly_counts","post2019_hourly_counts"].mean())
df = ds.sort_values(by=['time'])
axs = df.plot.line(figsize=(20, 6), color=["#0f9295","orange"])
axs.set_title('Foot Traffic by Time', size=20)
axs.set_ylabel('Average counts', size=14)
plt.show()
In [12]:
#distribution by traffic, by day
pivot = pd.DataFrame(pd.pivot_table(sensor_ds, values=['day_counts','night_counts'], index=['year'], aggfunc=np.mean))
rs = pivot.sort_values(by='year', ascending = False)
axs = rs.plot.line(figsize=(12, 5), color=['orange','#023545'], legend=True);

axs.set_title('Foot traffic by year by day and night', size=20)
axs.set_ylabel('Average counts', size=14)
plt.show()
In [13]:
#examine pre Covid and post 2019 foot traffic
ds = pd.DataFrame(sensor_ds.groupby(["dow","day"])["pre2020_hourly_counts","post2019_hourly_counts"].mean())
df = ds.sort_values(by=['dow'])
axs = df.plot.bar(figsize=(12, 4), color=["#0f9295","orange"])
axs.set_title('Foot Traffic by Day of Week - All Hours', size=20)
axs.set_ylabel('Average hourly counts', size=14)
plt.show()

The top 20 locations by foot traffic overall are displayed below.

In [14]:
#distribution by traffic, by day
pivot = pd.pivot_table(sensor_ds, values='day_counts', index=['sensor_id','sensor_description'], aggfunc=np.mean)
pivot_ds = pivot['day_counts'].nlargest(n=20)
pivot_ds.plot.bar(figsize=(12, 5), color='orange', legend=True);

#by night
pivot = pd.pivot_table(sensor_ds, values='night_counts', index=['sensor_id','sensor_description'], aggfunc=np.mean)
pivot_ds = pivot['night_counts'].nlargest(n=20)
axs = pivot_ds.plot.bar(figsize=(12, 5), color='#023545', legend=True);

axs.set_title('Top 20: Foot Traffic by location by day and night', size=20)
axs.set_ylabel('Average counts', size=14)
plt.show()

We will investigate if the traffic volumes by location are different depending on day and night traffic.

Day Economy

Compare day traffic before Covid and now¶

In [15]:
#distribution by traffic, by day
pivot = pd.pivot_table(df_day, values='pre2020_hourly_counts', index=['sensor_id','sensor_description'], aggfunc=np.mean)
pivot_ds = pivot['pre2020_hourly_counts'].nlargest(n=20)
pivot_ds.plot.line(figsize=(12, 5), color='#0f9295', legend=True, rot=90);

#by night
pivot = pd.pivot_table(df_day, values='post2019_hourly_counts', index=['sensor_id','sensor_description'], aggfunc=np.mean)
pivot_ds = pivot['post2019_hourly_counts'].nlargest(n=20)
axs = pivot_ds.plot.bar(figsize=(12, 5), color='orange', legend=True, rot=90);

axs.set_title('Top 20: Foot Traffic by location by day', size=20)
axs.set_ylabel('Average counts', size=14)
plt.show()

The chart above shows how signicantly traffic has changed after Covid from 2020 onwards for the top 20 locations.

We can see changes in the traffic of each sensor before Covid and after Covid on the map below. Please right click on the icon in the top right corner of the map and select the layer to see traffic before Covid.

In [16]:
#Visualise day data
m = folium.Map(location=[-37.8167, 144.967], zoom_start=15)# tiles='Stamen Toner'
locations = []
for i in range(len(df_daypft_beforecovid_avg)):
    row =df_daypft_beforecovid_avg.iloc[i]
    location = [(row.lat,row.lon)]*int(row['hourly_counts'])
    locations += location
marker_cluster  = MarkerCluster(
  name='day traffic before covid-19',
  locations=locations,
  overlay=True,
  control=True,
  color='red',
  show=False # this removes from automatic selection in display - need to select to show data points
  )
marker_cluster.add_to(m)

#next layer
locations = []
for i in range(len(df_daypft_aftercovid_avg)):
    row =df_daypft_aftercovid_avg.iloc[i]
    location = [(row.lat,row.lon)]*int(row['hourly_counts'])
    locations += location
marker_cluster  = MarkerCluster(
  name='day traffic after covid-19',
  locations=locations,
  overlay=True,
  control=True,
  ) 
marker_cluster.add_to(m)
folium.LayerControl().add_to(m)
m
Out[16]:
Make this Notebook Trusted to load map: File -> Trust Notebook

Clustering: Day Traffic¶

Discover if any patterns in day traffic, the day hours used were from 5am to 5pm.

Similarities between locations in terms of traffic, may mean similar solutions may be used to improve pedestrian traffic.

In [17]:
#identify class and features
use_list = ["sensor_id","lat","lon","hourly_counts"]
loc_day = pd.DataFrame(df_day[use_list])
print(loc_day.head(5))

codes = loc_day[['sensor_id']] #class
loc_day.drop('sensor_id', axis=1, inplace=True)

loc_day.head(5)
    sensor_id        lat         lon  hourly_counts
0          39 -37.813797  144.969957            604
12         39 -37.813797  144.969957              9
13         39 -37.813797  144.969957             28
14         39 -37.813797  144.969957             53
15         39 -37.813797  144.969957             99
Out[17]:
lat lon hourly_counts
0 -37.813797 144.969957 604
12 -37.813797 144.969957 9
13 -37.813797 144.969957 28
14 -37.813797 144.969957 53
15 -37.813797 144.969957 99
In [18]:
#examine data
loc_day.describe()
Out[18]:
lat lon hourly_counts
count 2.221632e+06 2.221632e+06 2.221632e+06
mean -3.781333e+01 1.449617e+02 6.245708e+02
std 6.468232e-03 8.673052e-03 8.125889e+02
min -3.782402e+01 1.449297e+02 1.000000e+00
25% -3.781874e+01 1.449564e+02 1.020000e+02
50% -3.781381e+01 1.449644e+02 2.880000e+02
75% -3.781102e+01 1.449668e+02 8.020000e+02
max -3.779432e+01 1.449747e+02 1.161200e+04
In [19]:
# identify optimal clusters 
features_list = ["lat","lon","hourly_counts"]
df_features = pd.DataFrame(loc_day[features_list])

#scale the features
mms = MinMaxScaler()
mms.fit(df_features) #day hours
data_transformed = mms.transform(df_features)

#data_transformed

# for each k value calculate sum of squared distances to the nearest cluster centre
Sum_of_squared_distances = []
K = range(1,10)
for k in K:
    km = KMeans(n_clusters=k)
    km = km.fit(data_transformed)
    Sum_of_squared_distances.append(km.inertia_)
    
#plot results
plt.plot(K, Sum_of_squared_distances, 'g*-')
plt.xlabel('Number of Clusters')
plt.ylabel('Sum of squared distances')
plt.title('Elbow Method')
plt.show()

For day traffic the optimal number of clusters is 3.

In [20]:
# run k-means clustering, define centroids
kmeans = KMeans(n_clusters=3, random_state=0).fit(loc_day)
centroids = kmeans.cluster_centers_

codes['cluster'] = kmeans.labels_
#codes.head()
In [21]:
df = df_features

#visualise data
x_axis = df['lon']
y_axis = df['lat']
cluster_legend = kmeans.labels_.astype(float)
plt.figure(figsize = (8,5))
sns.scatterplot(x_axis, y_axis, hue=cluster_legend, c=cluster_legend, palette='colorblind', s=60)
plt.title('Clustering Pedestrian Counts: Day Hours')
plt.show()
The exploratory data analysis on day and night showing initial findings is available, please refer to eda compare pedestrian traffic day night¶

There are three clusters in the day traffic. They indicate different levels of traffic common among sensors located in the city of Melbourne. We can use these findings to investigate what makes the traffic activity for these locations similar.

Similar sensor locations in terms of pedestrian counts, may have simliar solutions that can be applied to increase activity.

Next steps for this component, we can add in additional factors impacting traffic, and experiment with additional types of clustering techniques, to improve cluster detection and investigate the impact of these factors.

The factors to consider are many and are discussed in the use case on factors impacting pedestrian traffic

Night Economy

Compare night traffic before Covid and now¶

In [22]:
#distribution by traffic
#by night before covid
pivot = pd.pivot_table(df_night_beforecovid, values='pre2020_hourly_counts', index=['sensor_id','sensor_description'], aggfunc=np.mean)
pivot_ds = pivot['pre2020_hourly_counts'].nlargest(n=20)
pivot_ds.plot.bar(figsize=(12, 5), color='#0f9295', legend=True);

#by night after Covid
pivot = pd.pivot_table(df_night_aftercovid, values='post2019_hourly_counts', index=['sensor_id','sensor_description'], aggfunc=np.mean)
pivot_ds = pivot['post2019_hourly_counts'].nlargest(n=20)
axs = pivot_ds.plot.bar(figsize=(12, 5), color='#023545', legend=True);

axs.set_title('Top 20: Foot Traffic by location by night, before and after Covid', size=20)
axs.set_ylabel('Average counts', size=14)
plt.show()

The top 20 locations for night traffic, are different to the day traffic locations.

View the different layers showing before Covid and after Covid hourly traffic for night locations on the map below.

In [23]:
#Visualise night data
m = folium.Map(location=[-37.8167, 144.967], zoom_start=15)
locations = []
for i in range(len(df_night_beforecovid_avg)):
    row =df_night_beforecovid_avg.iloc[i]
    location = [(row.lat,row.lon)]*int(row['pre2020_hourly_counts'])
    locations += location
marker_cluster  = MarkerCluster(
  name='Night traffic before Covid-19',
  locations=locations,
  overlay=True,
  control=True,
  show=False # this removes from automatic selection in display - need to select to show data points
  )
marker_cluster.add_to(m)
locations = []
for i in range(len(df_night_aftercovid_avg)):
    row =df_night_aftercovid_avg.iloc[i]
    location = [(row.lat,row.lon)]*int(row['post2019_hourly_counts'])
    locations += location
marker_cluster  = MarkerCluster(
  name='Night traffic after Covid-19',
  locations=locations,
  overlay=True,
  control=True,
  ) 
marker_cluster.add_to(m)
folium.LayerControl().add_to(m)
Out[23]:
<folium.map.LayerControl at 0x167dcfb80>
In [24]:
m
Out[24]:
Make this Notebook Trusted to load map: File -> Trust Notebook

Clustering: Night Traffic¶

Discover if any patterns in night traffic, the night hours used were from 6pm to 4am.

In [25]:
#identify class and features
use_list = ["sensor_id","lat","lon","hourly_counts"]
loc_night = pd.DataFrame(df_night[use_list])
#print(loc_night.head(5))

codes = loc_night[['sensor_id']] #class
loc_night.drop('sensor_id', axis=1, inplace=True)

loc_night.head(5)
Out[25]:
lat lon hourly_counts
1 -37.813797 144.969957 384
2 -37.813797 144.969957 289
3 -37.813797 144.969957 158
4 -37.813797 144.969957 114
5 -37.813797 144.969957 115
In [26]:
# identify optimal clusters - these may be the same or different to day
features_list = ["lat","lon","hourly_counts"]
df_features = pd.DataFrame(loc_night[features_list])

#scale the features
mms = MinMaxScaler()
mms.fit(df_features) #day hours
data_transformed = mms.transform(df_features)

#data_transformed

# for each k value calculate sum of squared distances to the nearest cluster centre
Sum_of_squared_distances = []
K = range(1,10)
for k in K:
    km = KMeans(n_clusters=k)
    km = km.fit(data_transformed)
    Sum_of_squared_distances.append(km.inertia_)
    
#plot results
plt.plot(K, Sum_of_squared_distances, 'g*-')
plt.xlabel('Number of Clusters')
plt.ylabel('Sum of squared distances')
plt.title('Elbow Method')
plt.show()

For night traffic we will try using 3 clusters¶

In [27]:
# run k-means clustering, define centroids
kmeans = KMeans(n_clusters=3, random_state=0).fit(loc_night)
centroids = kmeans.cluster_centers_

codes['cluster'] = kmeans.labels_
#codes.head()
In [28]:
df = df_features

#visualise data
x_axis = df['lon']
y_axis = df['lat']
cluster_legend = kmeans.labels_.astype(int)
plt.figure(figsize = (8,5))
axs = sns.scatterplot(x_axis, y_axis, hue=cluster_legend, c=cluster_legend, palette='colorblind', s=60)
plt.title('Clustering Pedestrian Counts: Night Hours')
plt.show()
Congratulations

You've successfully used Melbourne Open Data to visualise day and night pedestrian traffic in and around the City of Melbourne!

For next steps please explore the City of Melbourne Open Data playground, such as stepping through other the use cases on pedestrian traffic.

Try the usecase on finding a new business location with pedestrian foot traffic and survey data or the use case on evaluating pedestrian traffic in one location over a day, week and month when starting a new business venture.

References

City of Melbourne Open Data Team, 2014 - 2021,'Pedestrian Counting System - Monthly (counts per hour)', City of Melbourne, date retrieved 11 Aug 2022, https://dev.socrata.com/foundry/data.melbourne.vic.gov.au/b2ak-trbp

City of Melbourne Open Data Team, 2014 - 2021,'Pedestrian Counting System - Sensor Locations', City of Melbourne, date retrieved 26 Aug 2022, https://data.melbourne.vic.gov.au/Transport/Pedestrian-Counting-System-Sensor-Locations/h57g-5234

Carlini L 2019, 'Clustering and Visualisation using Folium Maps', Kaggle, retrieved 23 Sep 2022 https://www.kaggle.com/code/lucaspcarlini/clustering-and-visualisation-using-folium-maps/notebook#Folium-Maps-Visualisation-by-Number-of-Occurences-and-Clustering

In [29]:
#save notebook, required so that step to convert to html, writes latest results to file
# may need to adapt for other OS, this is for Windows
# keyboard.press_and_release('ctrl+s')

# !jupyter nbconvert  usecase_evaluate_business_location_using_pedestrian_traffic_day_night.ipynb --to html